arXiv: 1509.04903vl [stat.AP] 16 Sep 2015 


The Annals of Applied Statistics 

2015, Vol. 9, No. 2, 1076-1101 

DOI: 10.1214/15-AOAS829 

(c) Institute of Mathematical Statistics, 2015 


WAVELET-DOMAIN REGRESSION AND PREDICTIVE 
INFERENCE IN PSYCHIATRIC NEUROIMAGING 

By Philip T. Reiss^’*’^ Lan Yihong Zhao*, 

Clare Kelly^’* and R. Todd Ogden^’^ 

New York University*, Nathan S. Kline Institute for Psyehiatrie Research) 

and Columbia University^ 

An increasingly important goal of psychiatry is the use of brain 
imaging data to develop predictive models. Here we present two con¬ 
tributions to statistical methodology for this purpose. First, we pro¬ 
pose and compare a set of wavelet-domain procedures for fitting gen¬ 
eralized linear models with scalar responses and image predictors: 
sparse variants of principal component regression and of partial least 
squares, and the elastic net. Second, we consider assessing the con¬ 
tribution of image predictors over and above available scalar pre¬ 
dictors, in particular, via permutation tests and an extension of the 
idea of confounding to the case of functional or image predictors. Us¬ 
ing the proposed methods, we assess whether maps of a spontaneous 
brain activity measure, derived from functional magnetic resonance 
imaging, can meaningfully predict presence or absence of attention 
deficit/hyperactivity disorder (ADHD). Our results shed light on the 
role of confounding in the surprising outcome of the recent ADHD- 
200 Global Competition, which challenged researchers to develop al¬ 
gorithms for automated image-based diagnosis of the disorder. 

1. Introduction. A major goal of current psychiatric neuroimaging re¬ 
search is to predict clinical outcomes on the basis of quantitative image 
data. Many studies have focused on “predicting” current disease states from 
brain images [e.g., Craddock et al. (2009), Sun et al. (2009)]. While seem¬ 
ingly less difficult than accurate prediction of future outcomes, the goal of 
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clinically useful imaging-based diagnosis has proved highly challenging [Ka¬ 
pur, Phillips and Insel (2012), Honorio et al. (2012)]. 

This paper addresses two important limitations of standard methods for 
using brain images to predict psychiatric outcomes: 

(i) Ordinarily, the voxels (volume units) of the brain are treated as inter¬ 
changeable predictors or “features.” Accuracy might be improved by prop¬ 
erly exploiting the spatial arrangement of the brain. 

(ii) In some cases brain images may prove successful for diagnostic clas¬ 
sification, but only because the images are related to one or more scalar 
covariates that drive the association. This is a nonstandard form of con¬ 
founding, and there seems to be no existing methodology for detecting it. 
In other words, little is known about how to assess whether image data of¬ 
fers “added value” for prediction, beyond what is available from nonimage 
data—which will typically be much simpler to acquire. 

To address limitation (i), we approach the general problem as one of re¬ 
gressing scalar responses on image predictors, which are viewed, as in Reiss 
(2006) and Reiss and Ogden (2010), as a challenging special case of func¬ 
tional predictors [Ramsay and Silverman (2005)]. The responses yi,...,yn 
are assumed to be generated independently by the model 

( 1 ) EF{yi,(f)), 

(2) g(fj,^) = tld + j Xi{s)l3{s)ds. 

Here EF{pi,4>) denotes an exponential family distribution with mean yi and 
scale parameter (j), along with a link function y, tj is an m-dimensional vec¬ 
tor of (scalar) covariates, of which the first is the constant 1; Xi'.S —)■ M 
is a functional predictor with domain 5 C or C ; and the correspond¬ 
ing effect, the coefficient function or coefficient image ffi.S —>■ M, is the 
parameter of interest. The simplest special case is the linear model 

(3) yi = tf5 + J Xi{s)f3{s)ds + £i, 

where the e* are independent and identically distributed errors with mean 0 
and variance (T^(= (p). When tj = 1 (i.e., no scalar covariates), model (3) is 
the extension, from one-dimensional to multidimensional predictors, of the 
functional linear model that has been studied by Marx and Eilers (1999), 
Cardot, Ferraty and Sarda (1999), Muller and Stadtmiiller (2005), Ramsay 
and Silverman (2005), Hall and Horowitz (2007), Reiss and Ogden (2007), 
Goldsmith et al. (2011) and many others. 

For the case of one-dimensional functional predictors, a popular way to 
take spatial information into account is to restrict /!(•) to the span of a spline 
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basis [e.g., Marx and Eilers (1999)]. Spline methods for two-dimensional 
predictors have been studied by Marx and Eilers (2005) and Guillas and 
Lai (2010), and by Reiss and Ogden (2010), whose work was motivated by 
neuroimaging applications. 

Some more recent work has considered neuroimaging applications with 
two- and three-dimensional predictors [Zhou, Li and Zhu (2013), Goldsmith, 
Huang and Crainiceanu (2014), Huang et al. (2013)]. In this paper, we pro¬ 
pose a set of new approaches based on a wavelet representation of the coeffi¬ 
cient image. The idea of transforming the images to the wavelet domain has 
previously appeared in the brain mapping literature, where it is customary to 
fit separate models at each voxel, with the image-derived quantity regressed 
on demographic or clinical variables of interest [e.g., Ruttimann et al. (1998), 
Van De Ville et al. (2007)]. But for our objective of using entire images in a 
single model to predict a scalar response, working in the wavelet domain has 
been mentioned as a natural idea [Grosenick et al. (2013)] but rarely if ever 
pursued, at least until the very recent work of Wang et al. (2014). Unlike 
spline bases, wavelet bases are designed for sparse representation and yield 
estimates that adapt to the features of the coefficient image. 

Limitation (ii) was highlighted by the results of the recent ADHD-200 
Global Gompetition for automated diagnosis of attention deficit/hyperactiv¬ 
ity disorder [ADHD-200 Gonsortium (2012)]. Teams were provided with 
functional magnetic resonance images from ADHD subjects and controls 
on which to train diagnostic algorithms, and then applied these algorithms 
to predict diagnosis in a separate set of images. A team of biostatisticians 
from Johns Hopkins University, whose methods are described by Eloyan 
et al. (2012), achieved the highest score for correct imaging-based classifi¬ 
cation and were declared the winners. But a team from the University of 
Alberta, which discarded the images and used just four scalar predictors 
[age, sex, handedness and IQ; see Brown et al. (2012)], attained a slightly 
higher classification score [see Caffo et al. (2012) for related discussion]. 

To address limitation (ii), we test the effect of image predictors via a 
permutation-based approach originally proposed in the neuroimaging lit¬ 
erature [Golland and Fischl (2003)], which we extend to allow for scalar 
covariates. We also consider how to extend the traditional notion of con¬ 
founding to settings with both scalar and image predictors. These ideas are 
illustrated using our wavelet methods, but are not specific to them; rather, 
they are applicable with other approaches to functional or high-dimensional 
regression. 

Our contributions can be summarized as follows: (i) We propose novel 
wavelet-domain methodology for regression with image predictors. While 
Wang et al. (2014) and Zhao, Chen and Ogden (2015) have studied the 
wavelet-domain lasso for image predictors, we also propose and compare 
several other methods, and consider the generalized linear case and the role 
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of scalar covariates, (ii) We extend predictive performance-based hypothesis 
testing [Golland and Fischl (2003)] to the case where scalar confounders 
are present, providing a new way to assess the usefulness of image-based 
prediction. 

In Section 2 we introduce wavelet bases, and motivate and outline a gen¬ 
eral template for scalar-on-image regression in the wavelet domain. Section 3 
describes three specific algorithms, which are evaluated in simulations in Sec¬ 
tion 4. Section 5 considers hypothesis testing and confounding with image 
predictors. In Section 6 the proposed methods are applied to a portion of 
the ADHD-200 data set, and the results point to a possible role of con¬ 
founding in the competition’s surprising result. Section 7 offers a concluding 
discussion. 


2. Wavelets and their use in regression on images. 


2.1. A brief introduction to wavelet basis representations. Wavelet bases 
are a popular way to obtain a sparse representation for functional data, in 
particular, when the degree of smoothness exhibits local variation [see Ogden 
(1997), Vidakovic (1999), Nason (2008) for statistically-oriented treatments]. 
A wavelet basis for L^(]R) is constructed from a scaling function (or “father 
wavelet”) 4> and a wavelet function (“mother wavelet”) ^ [see Figure 1(a)- 
(b)], with the following properties: 

• For each j G Z, the shifted and dilated functions {(pj^kix) = — 

fc) ; A: G Z} form an orthonormal basis for Vj, where • • • C V-i C Vq C Fi C 
• • • is a nested sequence of subspaces whose union is a dense subspace of 
L2(M). 

• For each j G Z, — A:): A: G Z} form an orthonormal 

basis for a “detail space” Wj satisfying = 1^ © Wj. 

Hence, for any integer jq > 0, Vj^ © Wj^ © IFjp+i © ■ • • is a dense subspace of 
L2(M). 

Given appropriate boundary handling, such as modifying the scaling and 
wavelet functions to be periodic, one can likewise construct orthonormal 
wavelet bases for L^[0,1], of the form 




eD 


JO 


eW. 


JO 


€Wjg + l 


—that is, scaling functions (corresponding to the large-scale features of 
the data), 2^° wavelet functions at level jo, 2-^°^^ wavelet functions at level 
jo + 1 and so on, with higher wavelet levels capturing hner-scale details. 
This multiscale structure is what makes wavelet bases so useful for sparse 
representation of functions with varying degrees of smoothness. 
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The wavelet decomposition level jo acts as a tuning parameter. A small jo 
implies that a small number (2-^°) of scaling functions are used to construct 
the macro features of the function, with most of the basis elements dedicated 
to providing detail at a variety of scales. A large jo allows for many more 
scaling functions, each at higher resolution, and thus fewer basis elements 
corresponding to detail. 

In practice, a function / G L^[0,1] is observed at finitely many points, 
ordinarily taken to be the N = 2'^ (for some positive integer J) equally 
spaced points 0, ^,..., (When the function is observed at a number 
of points that is not a power of 2, one can insert zeroes before and after 
to attain the next highest power of 2.) The observed values can then be 
interpolated by the Wdimensional truncated basis 

■ ■ • ) 2 ^ 0 - 1 } U ■ ■ ■) V'jo, 2 ^ 0 - 1 } U • • • U ■ ■ ■ ) 

The discrete wavelet transform (DWT), implemented by the 0{N) pyramid 
algorithm of Mallat (1989), expands / with respect to this basis. Given a 
judicious choice of (j) and tj, signals of varying smoothness can be well repre¬ 
sented with a small number of coefficients. Throughout this paper we use the 
compactly supported Daubechies (1988) “least-asymmetric” wavelets with 
10 vanishing moments, displayed in Figure 1. 

Wavelet bases for two dimensions can be constructed by taking tensor 
products of the cf) and ■0 functions. The two-dimensional scaling function is 
(p{x)(p{y) and there are three types of 2D wavelets: 4){x)'il:{y), ijj{x)4>{y) and 
'0(x)'0(y), roughly corresponding to “horizontal,” “vertical” and “diagonal” 
detail, respectively [see Figure l(c)-(d)]. These functions are dilated and 
translated just as their ID counterparts are. Wavelet bases for 3D are con¬ 
structed similarly. Morris et al. (2011) discuss alternative wavelet transforms 
for images that are not constructed as tensor products. 



(c) (d) 

f 


Fig. 1. (a) Scaling function (j> and (b) wavelet function tp for ID Daubechies (1988) 

“least-asymmetric” wavelets with 10 vanishing moments. 2D basis functions are formed 
from tensor products such as (c) {x,y) 1 -^ ip{x)(l>{y) and (d) {x,y) 'tp{x)tp{y). 
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2.2. A meta-algorithm for scalar-on-image regression. Henceforth, the 
functional predictor Xj(-) of (2), (3) will be replaced by the ith discretized im¬ 
age observation x* = (xi,..., xn)'^ = ..., Xi{sN)]'^, where si,..., sat G 

S are distinct spatial locations at which the function Xi is measured. Often, 
in practice, each image is given as a matrix or 3D array; Xj is then obtained 
by converting this into a vector. From now until Section 3.5 we focus on the 
linear model (3), which can now be written in matrix form as 

(4) y = T5 + X/3 + e. 

Here y = (yi,... £ = (ei,... ,en)^; T is the n x m matrix with ith 

row tf;x is the n X N matrix with ith row xJ ; and f3 = (,01,..., (3n)'^ is 
a similarly discretized version of the coefficient image /3. More precisely, for 
j = 1,..., N, (3j = Wjl3{sj), where the WjA are quadrature weights such that 
xf/3 is a good approximation to the integral in (3); but for image data, 
si,... ,sn typically form an equally spaced grid, so these weights are taken 
as constant and hence ignored. With these dehnitions, (3) is just the ith of 
the n equations that make up the vector equation (4). 

To simplify the notation, we shall use a single subscript and denote the 
wavelet basis functions for a given jo as The wavelet rep¬ 
resentation of the ith observed image Xi is Xi{s) = in which 

the wavelet coefficients are given by Xik = {xi,fk)- The coefficient vector 
Xj = (xji,..., Xjjv)^ can be written as Xj = Wxj, where W is an x or¬ 
thonormal matrix (which is not formed explicitly when Xj is computed by the 
DWT). Similarly the discretized coefficient function f3 can be represented in 
terms of its wavelet coefficients as f3 = WP, leading to the wavelet-domain 
version of model (4): 

y = T5 + X.W'^Wp + e 

(5) 

= T(5 + X/3 + e, 

where X is the n x N matrix with ith row x[. 

The key point is that the wavelet-domain form (5) is better suited than 
the original form (4) for applying sparse techniques for high-dimensional 
regression—both because wavelet bases are designed for sparse representa¬ 
tion of images [Mallat (2009)] and because the DWT approximately decorre¬ 
lates or “whitens” data [Vidakovic (1999)]. We can thus formulate a “meta¬ 
algorithm” for scalar-on-image regression in the wavelet domain: 

1. Apply the DWT to the image predictors to transform model (4) into 
model (5). 

2. Use some high-dimensional regression methodology to derive a sparse 
estimate /3. 
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3. Apply the inverse DWT to {3 to obtain a coefficient image estimate (3 
for the original model (4). 

Different choices for step 2 lead to specihc algorithms, as described in the 
next section. 

The above general scheme can be extended to multiple image predictors 
[cf. Zhu, Vannucci and Cox (2010)]. We note that this meta-algorithm has 
been applied before for ID functional predictors [Brown, Fearn and Vannucci 
(2001), Wang, Ray and Maffick (2007), Malloy et al. (2010), Zhao, Ogden 
and Reiss (2012)] and more for image predictors [Wang et al. (2014), Zhao, 
Chen and Ogden (2015)]. Past work on wavelet-domain classification, as op¬ 
posed to regression [e.g., Berlinet, Biau and Rouviere (2008), Zhu, Brown 
and Morris (2012), Chang, Chen and Ogden (2014)], may bear compari¬ 
son to our proposed methods. Morris et al. (2011) develop wavelet-domain 
functional mixed models with images as responses. 

3. Three wavelet-domain algorithms. 

3.1. Sparse wavelet-domain prineipal component regression. The func¬ 
tional linear model (3) is often fitted by assuming the coefficient function has 
a truncated functional principal component, or Karhunen-Loeve, represen¬ 
tation /l(s) = J2JLi CjPjis), where m is a positive integer and pi,p 2 , ■ ■ ■ ,Pm 
are the first m eigenfunctions of the covariance operator associated with the 
predictor functions Xi [e.g., Cardot, Ferraty and Sarda (1999), Muller and 
Stadtmiiller (2005), Cai and Hall (2006)]. The eigenfunctions pi, P 2 , ■ ■ ■ ■, Pm 
can be estimated by viewing the functional predictors as (highly) multi¬ 
variate data, and applying ordinary principal component analysis to the 
predictor matrix X. 

Here and in Section 3.2, we assume that X has mean-centered columns, 
that is, l^X = 0. The approach of the previous paragraph then amounts to 
assuming (3 = V^T for some 7 € M™', where UDV^ is the singular value 
decomposition of X, and comprises the leading m columns of V. Hence, 
estimation reduces to choosing d ,7 to minimize the principal component 
regression [PCR; Massy (1965)] criterion 

(6) lly-Td-XV^7ll2. 

(This is a slightly nonstandard PCR criterion, in that principal component 
reduction is applied only to X but not to T. A similar remark applies to the 
other criteria introduced below.) 

As shown by Reiss and Ogden (2007), PCR can be implemented more 
effectively by exploiting the functional character of the data. In the one¬ 
dimensional functional predictor case, this has usually meant forming smooth 


P. T. REISS ET AL. 


estimates of the eigenfunctions—as in the FPCRc method of Reiss and Og¬ 
den (2007), which expands the eigenfunctions with respect to a R-spline 
basis [cf. Cardot, Ferraty and Sarda (2003)]. But for image predictors, local 
adaptivity—the ability to capture sharp features in some areas vs. a high 
degree of smoothness elsewhere—becomes particularly important. This mo¬ 
tivates using a wavelet basis, rather than a spline basis, to represent the 
eigenfunctions, or, in other words, developing a wavelet-domain version of 
PCR as an instance of the meta-algorithm of Section 2.2. 

A nonsparse wavelet-domain PCR estimate would minimize 

(7) ||y-T5-XV^7f, 

which is analogous to (6) but based on the SVD of X rather than of X. 
However, the advantage of working in the wavelet domain is to obtain a 
sparse coefficient estimate by replacing the PC weights with weights 
from a sparse version of PCA. Several penalty-based methods have been 
proposed for sparse PCA [e.g., Zou, Hastie and Tibshirani (2006), Shen and 
Huang (2008), Witten, Tibshirani and Hastie (2009)], but we opted for the 
approach of Johnstone and Lu (2009), which is simpler than the penalized 
methods and, unlike them, was developed with a view toward sparse wavelet 
representations of signals. Johnstone and Lu (2009) propose to select the 
features or coordinates with highest variance, and apply PCA only to these. 
The resulting sparse PCR criterion is 

(8) \\y-T6-X*V*^jf- 

here X* consists of the c columns of X having highest variance, and 
consists of the leading m columns of V*, where U*D*V*^ is the SVD of X*. 
The minimizer (5,7) of (8) can be obtained by simple least squares. The 

vector of wavelet coefficient estimates is then /3 = V))j 7 , and the coefficient 
image estimate ^ = W^/3 is derived by the inverse DWT. 

3.2. Sparse wavelet-domain partial least squares. Whereas PCR reduces 
dimension by regressing on the leading PCs of the predictors, partial least 
squares [PLS; Wold (1966)] works by regressing on a set of components that 
are relevant to predicting the responses. A (nonsparse) wavelet-domain PLS 
estimate [cf. Nadler and Coifman (2005)] is derived by minimizing 

(9) ||y-T5-XR„7f 

[cf. (7)], where the columns of R.^ are defined iteratively as follows [Stone 
and Brooks (1990)]: 

• f 1 = arg min||j.||=i Cov(y, Xr); 
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• for j = 2,...,c, 

f j = arg _ _min Cov(y,Xr). 

||r|| = l, Xrm,=0 —1 

Once again, however, the point of working in the wavelet domain is to ob¬ 
tain a sparse estimate. To define sparse wavelet-domain PLS, as with PCR, 
we could have used penalization to derive sparse PLS components [Chun 
and Kele§ (2010)], but we instead opted to build on the aforementioned ap¬ 
proach of Johnstone and Lu (2009) to sparse PCA. A natural PLS analogue 
of that approach is to select those features Xj whose covariance with y has 
the greatest magnitude. This results in the sparse PLS criterion 

(10) ||y-T5-XtRt„7f; 

here Xl consists of the c columns of X having highest covariance with y, 
and the columns of Rm are defined analogously to those of Rm in (9). As for 
PCR, the least-squares minimizer (<5,7) of (10) leads directly to estimates 
of the wavelet coefficients /3 and of the resulting coefficient image f3. 

Our PLS algorithm is a wavelet-domain counterpart of the spline-based 
functional PLS procedure denoted by FPLSc in Reiss and Ogden (2007). 
We note that Preda and Saporta (2005) and Delaigle and Hall (2012b) have 
proposed more explicitly functional formulations of PLS, based on covariance 
operators on function spaces. 

3.3. Wavelet-domain elastic net. Since wavelet bases are well suited for 
sparse representation of functions, recent work has considered combining 
them with sparsity-inducing penalties, both for semiparametric regression 
[Wand and Ormerod (2011)] and for regression with functional or image 
predictors [Zhao, Ogden and Reiss (2012), Wang et al. (2014), Zhao, Chen 
and Ogden (2015)]. The latter papers focused on ii penalization, also known 
as the lasso [Tibshirani (1996)], in the wavelet domain. Alternatives to the 
lasso include the SCAD penalty [Fan and Li (2001)] and the adaptive lasso 
[Zou (2006)]. Here we consider the elastic net (EN) estimator for wavelet- 
domain model (5), which minimizes 

(11) ||y -TS- X^||2 + X[a\mi + (1 - «)||^||i] 

over (<5,/3), for a regularization parameter A > 0 and a mixing parameter 
a G [0,1] which controls the relative strength of the ii and £2 penalties on 
the coefficients [Zou and Hastie (2005)]. 

In the original nomenclature of Zou and Hastie (2005), the minimizer 
of (11) is the “naive” EN, whereas EN is a rescaled version. Since we shall 
make use of the generalized linear extension of EN as implemented by Fried¬ 
man, Hastie and Tibshirani (2010), we follow these authors in omitting the 
rescaling step. When a > 0, the £i penalty shrinks small coefficients to zero, 
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leading to a sparse wavelet representation. The wavelet-domain lasso is ob¬ 
tained when q; = 1. As explained by Zou and Hastie (2005), given a group 
of important features that are highly correlated, the lasso tends to select 
just one, whereas EN selects the entire group, which is often preferable— 
even in the wavelet domain, notwithstanding the “whitening” property of 
the discrete wavelet transform. 

3.4. Summary: Alternative routes to sparsity. All three of the above 
methods seek to represent the coefficient image /?(■) sparsely, as a linear 
combination of a subset of the wavelet basis functions, but they deploy very 
different strategies to choose that subset. The ii penalty in the elastic net 
criterion (11) has the effect of shrinking small coefficients to zero. This can 
be interpreted as imposing a prior that favors a sparse estimate. The PCR 
criterion (8) eliminates basis elements before performing regression, based 
on an implicit assumption that those basis elements with low variance in 
the data have little to contribute to the coefficient image. This assumption 
is broadly consistent, on the one hand, with the assumption of Johnstone 
and Lu (2009) that such basis elements are merely capturing noise; and, on 
the other hand, with the underlying assumption of PCR, namely, that the 
highest-variance principal components are most relevant in regression [see 
Cook (2007) for some relevant discussion]. The PLS criterion (10) likewise 
lets the data determine which basis elements to include; but here, instead 
of considering only the wavelet-transformed image data X as in PCR, we 
define relevant components by iteratively maximizing covariance with the 
responses y. 

3.5. Extension to the generalized linear case. The above three wavelet- 
domain algorithms can be straightforwardly extended from linear to gener¬ 
alized linear models (GLMs) of the form 

(12) g[E{y)]=T5 + ±P, 

for a link function g, generalizing (5). For PCR, one simply fits a GLM, 
as opposed to a linear model, to the sparse PGs. For the elastic net, the 
glmnet algorithm of Friedman, Hastie and Tibshirani (2010) is available for 
the generalized linear case. 

PLS is sometimes performed in an iteratively reweighted manner for 
GLMs [Marx (1996)], but in high-dimensional settings, such algorithms may 
require nontrivial modification [e.g.. Ding and Gentleman (2005)] to avoid 
convergence problems. Here we view PLS as a generic approach to con¬ 
structing relevant components, which may be employed beyond the linear 
regression setting [e.g., Nguyen and Rocke (2002), Delaigle and Hall (2012a)]. 
Thus, we construct PLS components exactly as we would for a linear model, 
but then use these components to fit a GLM. 
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3.6. Tuning parameter selection. For wavelet-domain PCR and PLS, 
three tuning parameters must be selected: the resolution-level parameter jo; 
the number c of wavelet coefficients to retain [i.e., the number of columns of 
X* in ( 8 ) or of in (10)]; and the number m of PCs or PLS components. 
We generally fix jo = 4, since we have found that resolution level to be gen¬ 
erally either optimal or near-optimal as measured by cross-validation (CV). 
For wavelet-domain elastic net, one must choose jq and the two penalty 
parameters a and A in (11), but we again prefer to fix jo = 4. 

These tuning parameters are chosen by repeated iL-fold CV. In the rth 
of R repetitions we divide the data points (yj,tj,Xj) (i = l,...,n) into K 
equal-sized validation sets indexed by Rp, ■ ■ ■ ,Ir,K- We can then choose the 
tuning parameters to minimize the CV score 

R K 

(13) 

r=l k=l i&Ir,k 

where 6-r,ki /3-r k estimates that result when model ( 12 ) is fitted (by 

PCR, PLS or EN) with the observations indexed by R^k excluded, and L is an 
appropriate loss function. For linear regression the standard loss function is 
the squared error L(yj;5,/3) = (y* — ifS — xf/3)'^. For the generalized linear 
case, following Zhu and Hastie (2004), we use the deviance D{yi; 5, (3) as the 
loss function. Specifically for logistic regression, unusually large summands 
can dominate criterion (13). Therefore, similarly to Chi and Scott (2014), 
we instead choose the tuning parameters by a robust CV score that takes 
the median rather than the mean over each set of K validation sets: 

1 ^ . c 

(14) —V median V D{yi;S_r,k,P-r,k)- 

it —■ k£\l,...,K} 
r=l 

4. Comparative simulation study. To test the performance of our meth¬ 
ods with realistic image predictors, we created a data set based on the 
positron emission tomography (PET) data previously studied by Reiss and 
Ogden (2010). That data set included axial slices from 33 amyloid beta 
maps, from which we extracted a square region of 64 x 64 voxels. To gen¬ 
erate a larger sample of n = 500 images, we applied a procedure similar to 
that of Goldsmith, Huang and Crainiceanu (2014): 

1. We estimated the (vectorized) principal components (eigenimages) 

Pi) • • •) P32 ^ ) 

with corresponding eigenvalues Ai,..., A 32 . 
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Fig. 2. Coefficient images (left) and (right) used in the simulation study. 


2. For i = 1,., 500, we generated the ith simulated predictor image as 
Xj = CjjPj, with the c^-’s simulated independently from the A^(0, Aj) 

distribution. 


In step 1 above we used the sparse PCA method of Johnstone and Lu (2009), 
including the 492 wavelet coefficients having the highest variance. This num¬ 
ber of wavelet coefficients was sufficient to capture 99.5% of the “excess” 
variance, in the sense of Section 4.2 of Johnstone and Lu (2009). 

We used two different true coefficient images /3 G , which are shown in 
Figure 2. The first image is similar to that used by Goldsmith, Huang 
and Crainiceanu (2014). Taking its domain to be [1,64]^, this coefficient 
image is given by = gi— § 2 , where < 71,52 are the densities of the bivariate 
normal distributions 


N 



,10l2 


and 



IOI 2 


respectively. The second image is a two-dimensional analogue of the 
“bumps” function used by Donoho and Johnstone (1994), and many subse¬ 
quent authors, to illustrate the properties of wavelets. 

We then simulated continuous or binary outcomes yi,... ,yn with speci¬ 
fied approximate values of the coefficient of determination R^, in the sense 
detailed in Supplementary Appendix A.l [Reiss et al. (2015)]. We gener¬ 
ated 100 sets of n = 500 continuous outcomes and 100 sets of 500 binary 
outcomes, for each of the values 0.1,0.5. 

We compared the performance of the three wavelet-domain methods de¬ 
scribed in Section 3 with three analogous “voxel-domain” methods, that 
is, sparse PCR, sparse PLS and elastic net without transformation to the 
wavelet domain. The wavelet- and voxel-domain methods are denoted by 
WPCR, WPLS and WNet and by VPCR, VPLS and VNet, respectively. 
We also included the R-spline-based functional PCR method (“FPCRij,” or 
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simply FPCR) of Reiss and Ogden (2007, 2010). Tuning parameter selection 
was as described in Supplementary Appendix A.l [Reiss et al. (2015)]. 

Performance was evaluated in terms of estimation error and prediction 
error. Estimation error is defined by the scaled mean squared error (MSE) 
11/3 —/3|p/||/3|p, where /3,/3 are the true and estimated coefficient images. 
Prediction error is dehned using a separate set of outcomes y]",..., y*, gener¬ 
ated from the same conditional distribution as yi,... ,y„. We use the scaled 
mean squared prediction error Y17=iiyi ~ criterion for linear 

regression and the mean of the deviances of y]',..., y* for logistic regression. 

Eigure 3 presents boxplots of the results. In general, all seven methods 
differ only slightly in prediction error. Much greater differences are seen for 
estimation error. Compared with the corresponding voxel-domain methods, 
the estimation MSE for wavelet methods is either roughly equal or clearly 
lower on average, and the variability of the MSE is often much lower. The 
wavelet methods also markedly outperform R-spline-based EPCR. Some- 
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Fig. 3. Estimation error, displayed as log{scaled MSE) (left subfigure), and prediction 
error (right subfigure) in the simulation study. 
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Fig. 4. True coefficient function from the comparative simulation study (top left) 
compared with five training-set coefficient function estimates (for data simulated under 
Iff — 1 setting) based on wavelet-domain lasso (other top panels) and voxel-domain lasso 
(bottom panels). The wavelet-based estimates are reasonably accurate, while each of the 
voxel-domain estimates has about 20-25 scattered voxels with nonzero values. Note the 
unequal scales. 

what contrary to expectation, the superior performance of wavelet methods 
is not clearly more pronounced for than for 

While the wavelet-domain methods do not clearly attain lower estimation 
error than voxel-domain methods for logistic regression with = 0.5, they 
do appear superior for the R? = 0.1 setting (which seems more realistic) 
and for linear regression. Moreover, qualitatively, wavelet-domain modeling 
helps to capture the main features of the coefficient image. Figure 4 displays 
an example of the training-set estimates derived by wavelet-domain lasso 
versus ordinary lasso. The wavelet-domain estimates are clearly more sim¬ 
ilar to each other and to the true coefficient image than are the ordinary 
lasso estimates. 

The wavelet-domain EN appears to have a slight edge overall compared 
with PCR and PLS. For this reason, and because wavelet EN (or at least its 
special case, the lasso) are now somewhat established in the literature [Zhao, 
Ogden and Reiss (2012), Wang et al. (2014), Zhao, Chen and Ogden (2015)], 
the simulations and real-data analyses in the next two sections consider only 
wavelet-domain EN. 

5. Inferential issnes. We now turn to what the Introduction referred 
to as limitation (ii) of predictive analyses in neuroimaging: the need for 
methodology to assess the predictive value of image data, in particular, 
when scalar covariates are present. 

5.1. Permutation testing. Consider testing the null hypothesis /3(s) =0 
in the general model (1), (2), that is, testing the null parametric model 
g{fii) versus the alternative (2). Informally, we are asking whether 

the images have predictive value beyond the information contained in the 
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scalar predictors. We propose a permutation test procedure in which the CV 
criterion (13) or (14) serves as the test statistic. If the true-data CV falls 
in the left tail of the distribution of permuted-data CV values, significance 
is declared. Permutation techniques of this kind have previously appeared 
in the neuroimaging and machine learning literature [Gotland and Fischl 
(2003), Ojala and Garriga (2010)]. 

The way the permutation distribution is constructed depends on the null 
model under consideration. When tj = 1 in (2) (no scalar covariates), one can 
simply permute the responses: that is, we repeatedly reorder the responses as 
..., for some permutation tt, reht the model, and record the CV 
value. For the linear model (3) with scalar covariates, a common approach is 
to permute the residuals from the null parametric model: that is, model (3) is 
refitted repeatedly with the ith response of the form yi + i-jr^i ), where the hats 
refer to fitted values and residuals from the model yi = tJS + Si. For some 
GLMs, however, such pseudo-responses based on permuted residuals are not 
of the correct form (e.g., for logistic regression, they are not binary). One 
can instead form pseudo-predictors, by regressing the predictor of interest on 
the nuisance covariates and permuting the residuals from this fit. In other 
words, we replace the design matrix (TjX) with 

(15) [T|PrX + n(I-PT)X], 

where Pt = T(T'^T)~^T^ and 11 is a permutation matrix. Although a 
similar idea was proposed by Potter (2005) for (ordinary) logistic regression, 
we have adopted it as our preferred permutation approach even for the 
linear case; see Supplementary Appendix B [Reiss et al. (2015)] for further 
discussion. 

We conducted a simulation study, using the ADHD-200 image data ana¬ 
lyzed in Section 6, to assess the type-I error rate and power of the permu¬ 
tation test procedure. Here we focus on logistic regression (see Supplemen¬ 
tary Appendix C [Reiss et al. (2015)], for linear regression results) and the 
wavelet-domain lasso. We first considered the case without scalar covariates 
and generated binary responses yi ~ Bernoulli(pj), z = 1,..., n = 333, where 

(16) log-^^ = 6o + xfp, 

I-Pi 

where do is a constant used to adjust the base rate (probability of event); 
Xj G is the ith image (expressed as a mean-zero vector); j3 is the true 
coefficient image shown in Figure 5(a) (similarly vectorized), multiplied by 
an appropriate constant to attain a specified value of B? (see Supplementary 
Appendix A [Reiss et al. (2015)], regarding the dehnition of B?). For each of 
the base rates 0.25, 0.5, 0.75 and each of the B? values 0.04, 0.07, 0.1, 0.15, 
0.2, 0.25, 0.3, we simulated 200 response vectors to assess power to reject 
Ho : /3 = 0 at the p = 0.05 level, as well as 1000 response vectors with /3 = 0 
{B? = 0) to assess the type-I error rate. Next we considered testing the same 
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> 

0} 



Fig. 5. (a) True coefficient image (3 used in the power study: gray denotes 0, black 

denotes 1. (b) Estimated probability of rejecting the null hypothesis /3 — 0 as a function of 
, with 95% confidence intervals, for model (16). (c) Same, for model (17). 

null hypothesis for the model 

(17) log—^ = (5o + ti5i + xf/3, 

1 -Pi 

with a scalar covariate ti such that for the submodel E{yi\ti) = ti5 is 
approximately 0.2. We generated the same number of response vectors as 
above for each of the above R? values, but here R? refers to the partial R? 
adjusting for ti (see Supplementary Appendix A.2 [Reiss et al. (2015)]). 

The results, displayed in Figure 5(b) and (c), indicate that the nominal 
type-I error rate is approximately attained for both models. For a given 
R? > 0, the power is somewhat higher for model (16) than for model (17), 
and highest for either model when the base rate is 0.5. Evidently, for base 
rates closer to 0 or 1, the CV deviance under the null hypothesis tends to 
be lower, and thus a stronger signal is needed to reject the null. 

Basing a test of the hypothesis /3(-) = 0 on the prediction performance 
of an estimation algorithm, rather than on an estimate of /3, is admittedly 
somewhat unconventional. In neuroimaging specifically, inference typically 
proceeds by fitting separate models at each voxel, and then applying some 
form of multiple testing correction [Nichols (2012)]. In the present setting 
of a single model that uses the entire image to predict a scalar response, it 
might be possible to assign p-values to individual voxels as in Meinshausen, 
Meier and Biihlmaim (2009). In practice, however, predictive algorithms 
tend to produce rather unstable estimates, as a number of authors have ac¬ 
knowledged [e.g., Craddock et al. (2009), Honorio et al. (2012), Sabuncu, 
Van Leemput and Alzheimer’s Disease Neuroimaging Initiative (2012)]. Our 
hypothesis testing approach thus sets the more modest inferential goal of 
verifying that the coefficient image as a whole yields better-than-chance 
prediction. 

5.2. Confounding. For ordinary, as opposed to functional, regression, 
confounding is said to occur when (i) x appears predictive of y, but this 
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relationship can be attributed to a third variable t such that (ii) t is pre¬ 
dictive of y and (iii) t is correlated with x. For example, birth order (x) is 
associated with the occurrence of Down syndrome (y), but this is due to the 
effect of the confounding variable maternal age (f) [Rothman (2012)]. 

To extend the above definition to the case of a functional predictor x('), 
suppose that (i) x{-) is ostensibly related to y, in the sense that /3(-) is not 
identically zero when model (2) includes no scalar covariates, but (ii) the 
scalar variable t is also predictive of y. A functional-predictor analogue of 
point (iii) is to suppose that t is correlated with f x(s}^(s) ds, where /3(-) 
is an estimate obtained with t excluded from model (2). Aside from this 
“global” analogue of (iii), it may be useful to consider a “local” analogue 
which holds if t is correlated with x(s), specifically for s such that /3(s) / 0; 
but this is somewhat less straightforward to assess. 

6. Application: fALFF and ADHD. 

6.1. ADHD-200 data set and candidate models. We now apply the wa¬ 
velet-domain elastic net to “predicting” ADHD diagnosis using maps of frac¬ 
tional amplitude of low-frequency fluctuations (fALFF) [Zou et al. (2008)] 
from a portion of the ADHD-200 sample referred to in the Introduction 
(http://fcon_1000.projects.nitrc.org/indi/adhd200/). fALFF is defined as the 
ratio of BOLD signal power spectrum within the 0.01-0.08 Hz range to total 
over the entire range. Yang et al. (2011) reported altered levels of fALFF in 
a sample of children with ADHD relative to controls, specifically in frontal 
regions. That study relied on the traditional analytic approach in neuroimag¬ 
ing, which regresses the imaged quantity (in this case fALFF) on diagnostic 
group, separately at each voxel. Here we employed scalar-on-image logis¬ 
tic regression, which reverses the roles of response and predictor, to regress 
diagnostic group on fALFF images. Our sample consisted of 333 individu¬ 
als: 257 typically developing controls and 76 with combined-type ADHD. 
The sample included 198 males and 135 females, with age range 7-20 (see 
Supplementary Appendix D [Reiss et al. (2015)], for further details). We 
chose the 2D slice for which the mean across voxels of the SD of fALFF was 
highest. This was the axial slice located at z = 26 (just dorsal to the corpus 
callosum) in the coordinate space of the Montreal Neurological Institute’s 
MNI152 template (4 mm resolution). We fitted two models. The first was 

(18) logit Pr(zth subject has ADHD) = 5 -|- J Xj(s)/3(s) ds, 

where Xi{s) denotes the fth subject’s fALFF image. The second model was 

(19) logit Pr(fth subject has ADHD) = tfS-\- J Xj(s)/3(s) ds, 

where the vector tj includes the fth subject’s age, sex, IQ and mean FD, as 
well as a leading 1 for the intercept. 
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a = 0.1 a = 0.4 a = 0.7 a=1 



Fig. 6. Coefficient image estimates for model (18) applied to the ADHD-200 data, using 
wavelet-domain elastic net with four different values of the mixing parameter a. 

Figure 6 shows the coefficient images attained for model (18) with each 
value of the mixing parameter a. As expected, increasing values of a lead to 
more-sparse estimates in the wavelet domain, and hence in the voxel domain. 
Figure 7 shows the CV deviance as a function of A for a = 0.1, which had 
the lowest CV deviance overall, as well as for a = 1. 

The left subfigure of Figure 8 shows that the CV deviance lies in the left 
tail of the permutation distribution for model (18), indicating a significant 
effect of the fALFF image predictors (p = 0.015). However, with the scalar 
covariate adjustment of model (19), this effect disappears. The next sub¬ 
section examines more closely how the scalar covariates may be acting as 
confounders. 

Our test of model (18) entailed 999 permuted-data fits with four candidate 
values of a and 100 of A, requiring 14.25 hours on an Intel Xeon E5-2670 
processor running at 2.6 GHz. In practice, we recommend parallelizing the 
permutations via cluster computing to make the computation time more 
manageable. In addition, truncated sequential probability ratio tests [Fay, 
Kim and Hachey (2007)] could in some cases reduce computation time via 
early stopping. We also explored fitting model (18) with the full 3D fALFF 
images as predictors; see Supplementary Appendix E [Reiss et al. (2015)]. 

6.2. Assessing and remedying confounding. As discussed in Section 5.2, 
the notion of confounding entails three elements (see Figure 9). Point (i), an 



X 


Fig. 7. Cross-validated deviance -t/— one approximate standard error, for the wavelet- 
domain elastic net models with a = 0.1,1. 
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Fig. 8. Permutation test results. For the full sample (left), a significant effect of the 
fALFF images is seen in model (18), but not in model (19), which adjusts for scalar 
covariates. When only younger individuals are included (right), neither model shows a 
significant fALFF effect. 

apparent effect of the image predictor fALFF on diagnosis, was established 
by the above permutation test result for model (18). To check point (ii) 
of the definition for each of the four scalar covariates under consideration, 
we performed an ordinary logistic regression with diagnosis (1 = ADHD, 
0 = control) as response and the above four scalar predictors. In Table 1 (at 
left), sex, age and IQ are all seen to be significantly related to diagnosis. 
See also Figure 10, which compares the fitted probabilities from this ordi¬ 
nary logistic regression with those resulting from models (18) and (19). The 
scalar-covariates model is seen to separate the two groups (black vs. gray 
dots) quite well; the image predictors increase the spread of the predicted 
probabilities without clearly improving the two groups’ separation. Based 
on these results, each of these three variables may be acting as a confounder. 

Next we consider point (iii), that is, the correlations of each scalar covari¬ 
ate with Xi(s)id(s) ds, where /3 is the coefficient image estimate from the 
fALFF-only model (18) or, equivalently, with the predicted logit probability 
of ADHD from that model. The results, shown at right in Table 1, point to 
age and sex as the principal confounders. (Here sex was treated as a binary 
variable, with 1 for male and 0 for female; a f-test and a Mann-Whitney 
test yielded similar results.) “Local” examination in the sense of Section 5.2 
reveals that the fALFF x(s) tends to be higher in males and in younger 


Y (ADHD) 



Fig. 9. Relationships among a putative predictor X, outcome Y and confounder T (see 
Section 5.2), illustrated with respect to the ADHD-200 data. 
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Table 1 

To examine element (ii) of confounding, an ordinary logistic regression was fitted with 
the four scalar predictors and with ADHD diagnosis as response; the resulting estimates 
are shown with 95% confidence intervals. For element (iii), we display the correlations of 
each predictor with the logit probabilities estimated by fitting model (18) 



(ii) 


(iii) 


Log odds ratio 

p-value 

Correlation 

p-value 

Intercept 

3.90 (1.11,6.78) 

0.007 



Sex (M-F) 

1.26 (0.65,1.91) 

0.00008 

0.14 (0.03,0.24) 

0.011 

Age 

-0.20 (-0.32,-0.09) 

0.0005 

-0.35 (-0.44,-0.25) 

6 X 10"“ 

IQ 

-0.03 (-0.05,-0.01) 

0.003 

-0.09 (-0.19,0.02) 

0.10 

Mean FD 

-2.51 (-8.80,3.56) 

0.42 

-0.04 (-0.15,0.07) 

0.47 


individuals for many voxels s; and such regions overlap considerably with 
those in which /3(s) >0. In other words, the ostensible association between 
fALFF and ADHD likely reflects the dependence of fALFF on age and sex, 
which in turn are related to ADHD in our sample. 

Further inspection revealed that, of the 67 individuals with age above 
14.0, only 8 had ADHD, with maximum age 17.43—whereas the controls 
had ages as high as 20.45. This led us to suspect that these older individuals 
might be driving the confounding with age that results in a spurious effect of 
fALFF on diagnosis. To investigate this possibility, we repeated the analysis 
using only the 266 individuals of age 14.0 or lower. Figure 8 shows that 
in this subsample, the fALFF effect is no longer significant, even without 



Fig. 10. Predicted probabilities of ADHD diagnosis, according to the images-only model 
(18); an ordinary logistic regression with the four scalar covariates; and model (19), which 
includes both. Also shown are the H? values, as defined in Supplementary Appendix A.l, 
for the three models. 
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adjusting for the scalar covariates. Moreover, given how far the test statistic 
is from the left tail of the permutation distribution, it seems unlikely that 
the loss of significance is due merely to the lower sample size. 

In general, absent careful matching at the design stage, it would be ad¬ 
visable to match the two diagnostic groups optimally on a complete set of 
clearly relevant variables, via algorithms such as those described in Rosen¬ 
baum (2010). Our aim here, however, was to show how a straightforward 
new notion of confounding for functional predictors can be used to identify 
a principal scalar confounder, whose impact can be removed by the crude 
device of simply truncating the age range. 

7. Discussion. Our analysis in Section 6 included only one imaging modal¬ 
ity and only a subset of the individuals from the ADHD-200 Global Compe¬ 
tition database. At any rate, our essentially negative result is consistent with 
the finding [Brown et al. (2012)] that diagnostic accuracy was optimized by 
basing prediction on scalar predictors, while ignoring the image data. In a 
blog comment on that outcome, cited both by ADHD-200 Consortium (2012) 
and by Brown et al. (2012), the neuroscientist Russ Poldrack suggested that 
“any successful imaging-based decoding could have been relying upon corre¬ 
lates of those variables rather than truly decoding a correlate of the disease.” 
Stated a bit differently, the competing teams’ successes in using the image 
data to predict diagnosis may have been brought about by confounding. 
But there appear to have been few attempts, if any, to study systematically 
how confounding may give rise to spurious relationships between quantita¬ 
tive image data and clinical variables. Similarly, analyses of the ADHD-200 
data, and related work on brain “decoding,” have devoted little attention 
to formally testing the contribution of imaging data to prediction of scalar 
responses [but see Reiss (2015)]. 

As we have shown, these two interrelated issues—testing the effect of 
image predictors and investigating possible confounders—can be handled 
straightforwardly within our scalar-on-image regression framework. The per¬ 
mutation test procedure of Section 5.1 found a statistically significant rela¬ 
tionship between fALFF images and ADHD diagnosis, but this disappeared 
when four scalar covariates were adjusted for. Further examination, in light 
of our extension of the notion of confounding to functional/image predictors 
in Section 5.2, pointed to age and sex as the key confounders. 

The ADHD-200 project is one of a number of recent initiatives to make 
large samples of neuroimaging data publicly available [Milham (2012)]. These 
initiatives have been a boon for statistical methodology development, but 
it must be borne in mind that even as neuroimaging sample sizes increase 
rapidly, they remain much smaller than the data dimension. No approach to 
scalar-on-image regression can completely escape the ensuing nonidentifia- 
bility of the coefficient image. We can, however, (i) put forth assumptions, 
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likely to hold approximately in practice, that reduce the effective dimension 
of the coefficient image; and (ii) employ multiple methods in the hope that 
these will converge upon similar coefficient image estimates, at least when 
the signal is sufficiently strong. 

With these considerations in mind, we have introduced three methods 
for scalar-on-image regression, each relying on a different set of assump¬ 
tions to achieve dimension reduction in the wavelet domain. Implementa¬ 
tions of these three methods, for 2D and 3D image data, are provided in the 
refund.wave package [Huo, Reiss and Zhao (2014)] for R [R Development 
Core Team (2012)], available at http://cran.r-project.org/web/packages/ 
refund.wave. This new package, a spinoff of the refund package [Crainiceanu 
et al. (2014)], relies on the wavethresh package [Nason (2013)] for wavelet 
decomposition and reconstruction. 

As discussed in Section 2.2, the three methods described here are merely 
three instances of a meta-algorithm for scalar-on-image regression. The 
refund, wave package allows for straightforward incorporation of alternative 
penalties, and other extensions may allow for more refined wavelet-domain 
algorithms, which may improve the stability and reproducibility of the co¬ 
efficient image estimates [Rasmussen et al. (2012)]. For instance, in wavelet- 
based nonparametric regression, thresholding is often performed in a level- 
specific manner. Analogously, it might be appropriate to modify criterion 
(11) so as to differentially penalize coefficients at different levels. One might 
also employ resampling techniques [cf. Meinshausen and Biihlmann (2010)] 
to select those wavelet basis elements that are consistently predictive of the 
outcome. Finally, wavelets whose domain is anatomically customized, such as 
the wavelets defined on the cortex by Ozkaya and Van De Ville (2011), offer 
a promising new way to confine the analysis to relevant portions of the brain. 
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SUPPLEMENTARY MATERIAL 

Supplementary appendices (DOI: 10.1214/15-AOAS829SUPP; .pdf). De¬ 
scription of simulation details, permutation of residuals for the proposed 
test procedure, a power study, selection of a subsample from the ADHD-200 
data set, and results with 3D predictors. 
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